Bioinformatics A Practical Guide to Next Generation Sequencing Data Analysis (Hamid D. Ismail)

28 ◾ Bioinformatics

1.5.4 Per Sequence Quality Scores

The per sequence quality score graph is created by plotting the mean sequence quality

(Phred scores) in the x-axis against read count (frequency) in the y-axis. The graph allows

us to see if a subset of reads have an overall low quality. The ideal curve is the one that shows

the majority of the reads having an overall quality score at or over 30 (Figure 1.18a); a peak

is toward the end of the x-axis. The presence of a large number of reads with an overall low

quality will indicate a systematic problem in the run. A warning sign is displayed if the

mean quality score of the majority of reads is below 27 (Figure 1.18b). An error is displayed

if the average quality score of the majority of the reads is below 20. The low-quality reads

can be filtered out to keep only the reads that pass a quality threshold.

1.5.5 Per Base Sequence Content

The per base sequence content graph depicts the percentage of each of the four bases (A, C,

G, and T) called at each position across all reads in a FASTQ file. The positions are plotted in

the x-axis against the base percentage in the y-axis. If there is no bias and library sequenc-

ing is random, we will expect no big difference between the distributions of the four bases

in each position. The percentage of each base is expected to be close to 25% and the four

lines will run approximately parallel to each other as shown in Figure 1.19a. Any deviation

from that, such as a bias or a systematic fault, will be suspected, and hence, some sequences

may be overrepresented as shown in Figure 1.19b. Higher percentage of some bases at the

beginning of the x-axis may indicate contaminating remnants of adaptor sequences or

other contaminating sequences. A warning is displayed if the difference between any of the

four bases is greater than 10% in any position and the failure of this metric occurs if the

difference between any four of bases is greater than 20% in any position.

1.5.6 Per Sequence GC Content

The per sequence GC content graph plots the number of reads in the y-axis against the

mean GC percentage per read in the x-axis (Figure 1.20). It depicts the distribution of GC

FIGURE 1.18 Per sequence quality scores.